The Digital Archive of Buddhist Temple Gazetteers and Named Entity Recognition (NER) in classical Chinese
نویسنده
چکیده
The identification of names and dates in larger corpora of historical texts is important for both traditional and digitally mediated research; it is part of reading as well as of exploring digital corpora. This paper is an introduction to a number of issues concerning named entity recognition (NER) for classical Chinese. In particular it introduces the “Digital Archive of Buddhist Temple Gazetteers” (http://buddhistinformatics.ddbc.edu.tw/ fosizhi/), as a benchmark corpus for NER on classical Chinese and illustrates how marked-up corpora can provide answers to question that could not otherwise be addressed. The “Digital Archive of Buddhist Temple Gazetteers” is an open source and access archive of local histories of Chinese Buddhist sites. Names and dates were encoded with XML/TEI and associated with authority databases. The archive, which contains classical texts in a variety of genres, can serve as testing data for experiments in NER and POS tagging. The data is made available as part of the article. We also show that for classical Chinese even a custom-made person name dictionary, created during the markup of the corpus, cannot in turn be used to parse the same corpus successfully without further intervention.
منابع مشابه
Token Gazetteer and Character Gazetteer for Named Entity Recognition
Named entity recognition (NER) in information extraction (IE) systems is usually based on large gazetteers — datasets of well-known and classified entities. NER is also often performed by independent look-up piece of code, which is considered as a bottleneck of many NER systems. In this paper, we present two approaches for building tree gazetteers for NER; i.e. lookup by token and by character.
متن کاملUnsupervised Named-Entity Recognition: Generating Gazetteers and Resolving Ambiguity
In this paper, we propose a named-entity recognition (NER) system that addresses two major limitations frequently discussed in the field. First, the system requires no human intervention such as manually labeling training data or creating gazetteers. Second, the system can handle more than the three classical named-entity types (person, location, and organization). We describe the system’s arch...
متن کاملPAYMA: A Tagged Corpus of Persian Named Entities
The goal in the named entity recognition task is to classify proper nouns of a piece of text into classes such as person, location, and organization. Named entity recognition is an important preprocessing step in many natural language processing tasks such as question-answering and summarization. Although many research studies have been conducted in this area in English and the state-of-the-art...
متن کاملImprovement of Chemical Named Entity Recognition through Sentence-based Random Under-sampling and Classifier Combination
Chemical Named Entity Recognition (NER) is the basic step for consequent information extraction tasks such as named entity resolution, drug-drug interaction discovery, extraction of the names of the molecules and their properties. Improvement in the performance of such systems may affects the quality of the subsequent tasks. Chemical text from which data for named entity recognition is extracte...
متن کاملNamed entity recognition with document-specific KB tag gazetteers
We consider a novel setting for Named Entity Recognition (NER) where we have access to document-specific knowledge base tags. These tags consist of a canonical name from a knowledge base (KB) and entity type, but are not aligned to the text. We explore how to use KB tags to create document-specific gazetteers at inference time to improve NER. We find that this kind of supervision helps recognis...
متن کامل